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Abstract. The experience of the use of applied containerized biomedical software tools in cloud environment is 
summarized. The reproducibility of scientific computing in relation with modern technologies of scientific calculations 
is discussed. The main approaches to biomedical data preprocessing and integration in the framework of the intelligent 
analytical system are described. At the conditions of pandemic, the success of health care system depends significantly 
on the regular implementation of effective research tools and population monitoring. The earlier the risks of disease can 
be identified, the more effective process of preventive measures or treatments can be. This publication is about the 
creation of a prototype for such a tool within the project «Development of methods, algorithms and intelligent analy- 
tical system for processing and analysis of heterogeneous clinical and biomedical data to improve the diagnosis of com- 
plex diseases» (M/99-2019, M/37-2020 with support of the Ministry of Education and Science of Ukraine), implement- 
ted by the V.M. Glushkov Institute of Cybernetics, National Academy of Sciences of Ukraine, together with the United 
Institute of Informatics Problems, National Academy of Sciences of Belarus (FI9UKRG-005 with support of the Bela- 
russian Republican Foundation for Fundamental Research). The insurers, entering the market, can insure mostly low 
risks by facilitating more frequent changes of insurers by consumers (policyholders) and mixing the overall health insu- 
rance market. Socio-demographic variables can be risk adjusters. Since age and gender have a relatively small explana- 
tory power, other socio-demographic variables were studied — marital status, retirement status, disability status, educati- 
onal level, income level. Because insurers have an interest in beneficial diagnoses for their policyholders, they are also 
interested in the ability to interpret relevant information — upcoding: insurers can encourage their policyholders to con- 
sult with doctors more often to select as many diagnoses as possible. Many countries and health care systems use diag- 
nostic information to determine the reimbursement to a service provider, revealing the necessary data. For processing 
and analysis of these data, software implementations of construction for classifiers, allocation of informative features, 
processing of heterogeneous medical and biological variables for carrying out scientific research in the field of clinical 
medicine are developed. The experience of the use of applied containerized biomedical software tools in cloud environ- 
ment is summarized. The reproducibility of scientific computing in relation with modern technologies of scientific cal- 
culations is discussed. Particularly, attention is paid to containerization of biomedical applications (Docker, Singularity 
containerization technology), this permits to get reproducibility of the conditions in which the calculations took place 
(invariability of software including software and libraries), technologies of software pipelining of calculations, that 
allows to organize flow calculations, and technologies for parameterization of software environment, that allows to re- 
produce, if necessary, an identical computing environment. The main approaches to biomedical data preprocessing and 
integration in the framework of the intelligent analytical system are described. The experience of using the developed 
linear classifier, gained during its testing on artificial and real data, allows us to conclude about several advantages pro- 
vided by the containerized form of the created application: it permits to provide access to real data located in cloud en- 
vironment; it is possible to perform calculations to solve research problems on cloud resources both with the help of 
developed tools and with the help of cloud services; such a form of research organization makes numerical experiments 
reproducible, i.e. any other researcher can compare the results of their developments on specific data that have already 
been studied by others, in order to verify the conclusions and technical feasibility of new results; there exists a univer- 
sal opportunity to use the developed tools on technical devices of various classes from a personal computer to powerful 
cluster. 

Keywords: classifier; cloud service; containerized application; gene expression data; isolated software 
environment; reproducibility of calculations; biomarker. 
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Anoranis. [ligcyMospano JocBid BUKOpHcTaHHA MIpukKaqHux KOHTeMHepH30BaHHx OioMeAMYHUX TporpaMHHUx 
3aco0iB y XMapHOMy cCepeyOBUI. BKa3aHo WIWIAXH 3a0e3Ie4eCHHA BITBOPIOBaHOCTi HayKOBHX OOYHCJICHb Ip BUKO- 
PHCTaHHi Cy4acHHXx TeXHOJOrIM HayKOBHX po3spaxyHKiB. ONMCaHO OCHOBHI IMiAXOAU WO WomepeqHboI OOpoOKu Ta iH- 
Terpalli OioMeqM4HUX aHUX y paMKax 1HTeJIeKTYaIbHO! aHaiTH4HO! cuctemu. B yMopax WaHyeMii ycrixu cucTeMu 
OXOPOHH 30pPOB’sA CYTTEBO 3aJIe%KAaTb BI PeryAPHOTO BUPOBAIPKCHHA e*PeCKTHBHUX 3acOOiB JOCIIKeHb 1 MOHITOPHH- 
ry cTauy HaceyienHa. Wu paHillle BAaeTbCA BUABUTH PH3HKH MOABH 3AXBOPIOBaHHA, THM e(PeKTHBHILe MOKe HTH Tpo- 
lec NpodilakTHYHHXx 3axoiB ao WikyBaHHa. Y Tani WyOikalli MAeTbCA Mpo CTBOpeHHA UpOTOTHMy Takoro 3acoby 
B paMKax IIpoeKTy «Po3poOKa MeTOIB, aIFOPHTMIB i IHTeIeKTyaIbHO! aHAITHYHO! CHCTeMH AIA OOPOOKU Hi aHami3y 
Pi3HOPITHUX KHIGHHX Ta OlOMeqHYHHX WaHuxX 3 MeTOIO BUOCKOHAaJICHHA iarHOCTHKU CKJIaHUX 3aXBOPIOBaHb»»> 
(M/99-2019, M/37-2020 3a nigtpumKu Minictepctsa ocBiTH Ta HaykH YKpaiHn), Wo BAKOHyeTECA IHcTHTyTOM Kidep- 
HeTHKH imMeHi B.M.I myutkopa HAH Yxpainn criibHo 3 OO’eqHaHUM iHCTUTYTOM TIpoOsem indopmatuku HAH bino- 
pyci (D19YKPI-005 3a mizrpumKu binopycpxkoro peciyOmkaHcbKoro PoHAy PyHTaMeHTaIbHUX JOC KeHb). Crpa- 
XOBMKH, IO BXOJATbh Y PHHOK, MO2KYTb CTpaxyBaTH MepeBar%kKHO HH3bKi PH3MKH, CIIPHAIOUM YaCTIMIMM 3MiHaM CTpaxo- 
BHKIB 3 OOKY CTpaxyBaJIbHUKiB 1 3MiLyIOUM 3arasIbHu PHHOK CTpaxyBaHHa. KopuryBayaMu Ppu3HKy MOxKyTb OyTH 
COMMaIbHO-HeMOrpadi4Hi 3MiHHi. OCKUIDKM BIK 1 CTaTb M@IOTb BiJHOCHO HEBEIMKY MOACHIOBAJIbHY CIIPOMO2KHICTH, TO 
BUBYANIMCA HUN COMaIbHO-AeMorpadi4Hi 3MiHHI — CiMeiiHHi CTaTyc, MeHCiiHMi cTaTyc, CTaTyc 1HBaJIHOCTI, OCBIT- 
Hill PiBeHb, PIBCHb NOXOAYy. OCKiIbKM CTpaXOBHKH MalOTb iHTepec JO BUTITHUX MiarHOSiB JIA CBOIX CTpaxXyBaJIbHUKiB, 
TO TaKO2K MatOTb IHTepec JO MOXKIIMBOCTeli TpakTYBaHHA BINOBITHO! iHopMaLiil — NepekosyBaHHa inopmawii: crpa- 
XOBHKH MO2KYTb 3A0XO4YBaTH CBOIX CTpaXyBalIbHUKIB KOHCYJIbTYBaTHCA 3 JKapaMH, WOO BiZOupaTH Oibue WiarHo- 
3iB. baraTo kpaiHt i CHCTeM OXOPOHM 300POB’s BUKOPHCTOBYIOTD WiarHOCTHYHy IHPOpMaliO JUIA BU3HAYCHHA BiTLIKO- 
AyBaHHa NpoBalizepy BIAMOBIZHHX NOCIYr, BIAKPHBaIOU HeOOXxiAHi Wa Uboro AaHi. [1 oOpoOKu Ht aHamisy WHx 1a- 
HUX PO3POOIAIOTHCA MpOrpaMHi peamizaili NOOyAOBU KiacuPikaTopiB, BUAWICHHA iH(:POpMaTHBHUX O3HAK, OMpallio- 
BaHHA pi3HOPITHHX MeAHKO-O1ONOTIYHHX 3MIHHHX JIA IpOBeCHHA HayKOBUX OCI PKeHb y rasly3i KMHIYHOI MeqH- 
WHHH. Y cTaTTi MiACyMoBaHO JOCBI, BAKOPUCTaHHA NpHKaqHux KOHTeMHepu30BaHHXx O1OMeHYHUX MporpaMHHx 3a- 
coi y XMapHomMy cepesOBHU. Bka3aHo WWIAxH 3a0e3Me4eHHA BIATBOPIOBaHOCTI HayKOBHX OOYHCJIeCHb Ip BAKOpHc- 
TaHHi Cy4acHHX TeXHOJIOrIM HayKOBUX PO3paxyHKIB. 30KpeMa, yBara IIPHBepTacTbeA JO KOHTelHepH3alli Oiomequ4- 
HX OWaTKiB (TexHosoril Docker, Singularity), 3a paxyHOK YOro JOCATa€TECA BIATBOPIOBAaHICTb CepeOBUUa JIA BH- 
KOHaHHA OOUHCIIeHb (BHKOPHCTaHHA 1JeCHTHYHHX MporpaMHux 3acoOiB Ta OiOsioTeK), TeXHONOrii KOHBeEpH3alili, LO 
JOMOMara€ OpraHi3yBaTH OOUMCJICHHA B MOTOKOBOMY pe%KHMi, Ta TeXHOJIOriI MapaMeTpu3allli OOUMCIIOBAJIBHOTO Cepe- 
TOBMINA, UO TO3BOAE, 34 HEOOXITHOCTI, CTBOPIOBATH I,eHTHUHe OOUNCIOBAIIbHe CepeqoBUuIe. OnMcaHO OCHOBHI Mi- 
XOQH WO MonepexHbOI OOpoOKH Ta inTerpalii OioMeqHYHHX JaHMX B PaMKax iHTeJIeKTYaIbHOi aHasITHYHOI CHCTeMH. 
JlocBiq, BUKOpucTaHHA pospoONeHoro TiniHOro KacudikatTopa, HaOyTHM Mp Horo TecTyBaHHi Ha WTyYHHX Ta pe- 
@JIbHUX JaHUX, JO3BOJIA€ 3POOHTH BHCHOBOK MIpo JeKiJIbKa MepeBar, AKI Haylac KOHTeMHepH30BaHa (popMa CTBOpeHoro 
WOMAaTKy: BAA€TECA 3a0e3Ne4HTH JOCTyM WO peasIbHUX JaHHX, PpO3TAaWIOBAHUX y XMapHHXx CepeOBMUAaX; 3a0e3meyuy- 
€TBCA MOXKIMBICTh BUKOHAHHA OOUNCIICHb JIA pO3B’A3YBaHHA JOCIHHUbKUX 3aflad Ha XMapHUX pecypcax AK 3a O- 
TIOMOTOIO po3poOeHHX 3acoOiB, Tak 1 3a JOMOMOTOIO XMapHHX CepBiCciB; Taka (popMa OpraHi3al{il JOCIYKeHb POOUTb 
4MCJIOBI CKCIIepHMeHTH BiTBOpIOBaHUMH, TOOTO OyAb-AKMii IHWIM NOCIIHUK MOxKe NOpiBHATH pe3yJIbTaTH poooTH 
CBOIX po3poO0K Ha KOHKpeTHHX AaHUX, AKI BIKE BUBUYAJIM IHU, 3 MeETOIO MepeBIPHTH 3pOOsIeHI BHCHOBKH Ta TeXHI4HI1 
MO2KIMBOCT1 HOBHX PO3PO00K; 3’ABJIAETCA YHIBEPCaJIbHa MOXKIIMBICTh BUKOPMCTOBYBaTH PO3poOsIeHi 3acoOu Ha Tex- 
HIYHHX MIPHCTPOAX PisHOTO Kacy Bi MepCOHAIbHOFO KOMI’1oTepa AO NOTYyKHOFO KacTepa. 

Kirovosi copa: Klacudikatop; xMapHuii cepBic; KOHTeiiHepv30BaHMii HOLaTOK; WaHi excripecii renis; 
i30JIbOBaHe MIporpaMHe CepeOBHIe; BIATBOPIOBaHICTb OOYHCIICHb; OioMapKep. 


Introduction 

This publication summarizes the expe- 
rience of the use of applied containerized soft- 
ware tools in cloud environment, which the 
authors gained during the project «Develop- 
ment of methods, algorithms and intellectual 
analytical system for processing and analysis 
of heterogeneous clinical and biomedical data 
in order to improve the diagnosis of complex 
diseases», accomplished by the team from the 
United Institute of Informatics Problems of the 
NAS of Belarus and V.M.Glushkov Institute 
of Cybernetics of the NAS of Ukraine. The 
main approaches and program tools for the 
development of intellectual analytical system 
are described. 


Problem formulation 

At the conditions of pandemic, the suc- 
cess of health care system depends signifi- 
cantly on the regular implementation of effec- 
tive research tools and population monitoring 
[1-2]. Decisions on which the people’s lives 
depend are regularly made not only by indivi- 
duals, but also by legislative and executive in- 
stitutions of power that implement the func- 
tion of state health care system. These deci- 
sions take into account the possibility of pre- 
serving and prolonging human life by means 
of scarce resources (for example, financial, 
human, temporal resources). Such decisions 
are made by government institutions to imple- 
ment the functions of defence and security, 
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law and order, macroeconomic management, 
protection of property rights. 

Analysis of modern researches and 
publications 

The countries, having a national health 
service or national health insurance, usually 
allow government agencies to make decisions 
on orders for new products (pharmaceuticals, 
therapies and medical devices). As a rule, in- 
novations, that promote therapeutic treatment 
with a lower probability of early death within 
a certain risk group, predominate [3]. Because 
such innovations involve additional costs, 
cost-cutting innovations are often neglected. 

For instance, providing a multimillion- 
dollar mobile coronary unit can help treat pa- 
tients with heart attacks quickly, significantly 
reducing the number of lethal cases on the 
way to a hospital. The long-term drug therapy 
for patients with hypertension who use anti- 
hypertensive drugs can also prevent heart at- 
tacks, significantly supporting the economy of 
research and development (R&D) in pharma- 
ceuticals. The installation of dialysis equip- 
ment for patients with chronic renal failure 
promotes R&D in manufacturing medical 
equipment. 

Goal of the research 

The earlier the risks of disease can be 
identified, the more effective process of pre- 
ventive measures or treatments can be [4]. 
Life-saving costs are borne not only in the 
field of health care: in the field of transport, in 
locations with a higher number of road acci- 
dents, there are issues of improving the quality 
of roads (not only road surface), which must 
be met by local communities and government 
agencies; in the field of transport, there are 
also issues of proper arrangement of roads 
within residential areas in order to reduce 
speed of vehicles and to conduct permanent 
video surveillance. Of course, the practical 
realization of responses to those issues in- 
volves certain expenditures of the local or 
state budget. 

Main results 

In the field of environmental protection, 
there are questions about ensuring the levels of 
security systems for such dangerous enter- 


prises as a nuclear power plant or a chemical 
plant; if the level of security system is insuffi- 
cient, an accident can occur threatening the 
lives of millions of people. One of the conse- 
quences of the 1986 Chornobyl disaster was 
an increase in cancer cases, especially in 
Ukraine and Belarus. In thermal power plants 
burning coal, there are questions about the 
cost of filters that can contain sulfur dioxide 
and other harmful emissions into the atmo- 
sphere. Such emissions increase the incidence 
of respiratory diseases among the people. 

In all the above issues, government ins- 
titutions cannot make rational decisions with- 
out a comprehensive and accurate assessment 
of future gains (and losses) caused by the im- 
plementation of a particular project, as well as 
without comparison of such gains with the 
present value of cost flow associated with the 
project. It is important for decision makers to 
measure gains and costs in the same units. 
Since project costs are usually measured in 
monetary terms, it makes sense to measure all 
gains in monetary terms as well. Therefore, 
the prolongation of life or improvement of hu- 
man health, caused by the implementation of 
project should also be measured in monetary 
units. Since it is difficult to assess the status of 
health and life for a human being in monetary 
units, economists have developed alternative 
methods for assessing the state of health and 
human life. 

Different approaches to economic health 
assessment compare the benefits of medical 
intervention with the costs of this intervention. 
Gains from intervention can be measured by 
physical units on a one-dimensional scale, mo- 
netary units, units of cardinal utility function 
reflecting the multidimensional concept of 
health in a scalar index. 

Since the 1990-s, several states in the 
world have taken steps to increase competition 
for their health care insurers, hoping to im- 
prove efficiency in their fields of health insu- 
rance and health care. Then the generalized 
equality of price and marginal cost will mean 
that competing health insurers will charge a 
high premium for high risks and at the same 
time a low premium for low risks: high risks 
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are characterized by a relatively high expected 
cost of treatment due to the high probability of 
disease. As the state wants all its citizens to be 
provided with health insurance, there are 
issues of risk selection in health insurance 
markets. 

One way to ensure an universal access 
to health insurance is to provide targeted sub- 
sidies to the poorer strata of population to 
cover insurance premiums. In practice, govern- 
ments regulate premiums, effectively elimina- 
ting the dependence of premium charged by 
an insurer on risk: in the United States, for 
example, premium regulation applies so called 
a community rating. In addition, the German 
and Swiss regulators typically require insurers 
to follow an open enrollment policy and ac- 
cept all the applications. In the United States 
Medicare gives its beneficiaries a choice bet- 
ween the Medicare Plan itself and competing 
health care plans, which receive a capitation 
payment for every policyholder. 

Therefore, in the countries mentioned, 
there is a natural incentive to risk selection. If 
each person pays the same insurance pre- 
mium, the insurer will expect losses with high- 
risk individuals (of high-risk type) and gains 
with low-risk individuals (of low-risk type). 
The economic viability and balance of any 
health insurer presumes a sufficient number of 
low-risk persons insured: insurers try to attract 
as many such persons as possible. Therefore, 
under the pressure of competition, all the insu- 
rers will take part in the collection of cream on 
market (cream-skimming), attracting favo- 
rable risks and avoiding adverse risks. 

Risk selection can take many forms. On 
the one hand, health insurers can implement 
direct risk selection by influencing who would 
sign the insurance contract: for example, the 
insurers may not pay their attention to the draft 
contract from a high-risk person. Individuals 
who are likely to need some medical care may 
be asked to sign a contract that provides addi- 
tional discount services or outright payments. 
On the other hand, indirect risk selection is the 
development of payment packages or contrac- 
ting with service provides that involve low- 
risk individuals but do not involve high-risk 


ISSN 2710-1673. Wtyannii intestext, 2020, Ne 3 


persons. Direct risk selection concerns the 
problem of individual access to a service, and 
indirect one — the quality problem. 

The both forms of risk selection will oc- 
cur only when insurers or their consumers 
possess information about individual health 
care costs. Direct risk selection require insu- 
rers to be able to observe the characteristics of 
physical persons that correlate with their ex- 
pected costs — gender, age, social behaviour, 
and so on. For instance, if healthy people use 
the Internet more often, the risk selection stra- 
tegy is to market insurance contracts online: 
this way people do not have to know their type 
of risk. However, people need to know their 
type of risk in indirect risk selection: for 
example, people need to know the likelihood 
that they will use certain services. Such perso- 
nal data allow insurers to develop payment 
packages and attract service providers with 
different types of risk. 

Direct and indirect risk selection can 
take place simultaneously: measures that ex- 
clude one selection should not affect another. 
For instance, if the benefit package is strictly 
regulated, preventing indirect risk selection, 
insurers may remain interested in attracting fa- 
vorable risks and thus turn to another risk se- 
lection — direct risk selection. On the contrary, 
if insurers do not have the ability to select 
risks directly, they retain the incentive to deve- 
lop a benefit package that attracts low risks 
and avoids high risks. Indirect risk selection is 
closely related to the phenomenon of unfavo- 
rable (adverse) selection in insurance markets, 
which happens when policyholders have more 
information about their type of risk in compa- 
rison with their insurers. This phenomenon 
takes place regardless of the actions of state. 
At the same time, indirect risk selection is an 
implication of state regulations for premiums. 

To avoid unwanted behavior by insurers 
in selecting risks, certain measures can be 
taken based on the assumption of compulsory 
health insurance, forcing them to cover high 
risks by means of low risks. 

First, open enrollment guarantees that 
some insurers will take some high risks. At the 
same time, legislation, regulation and repor- 
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ting may prevent obvious opportunities for di- 
rect risk selection: for example, the law may 
limit the insurer’s financial and other benefits 
from taking low risks. 

Second, the measure against indirect 
risk selection is the regulation of benefit pack- 
age. On the one hand, lower bounds of bene- 
fits can be envisaged, forcing insurers to offer 
benefits that are important for high risks (say, 
for the treatment of different types of dia- 
betes). On the other hand, upper bounds of 
payments may prevent insurers from including 
low-risk services (say, fitness center services) 
in their contracts. In addition, certain types of 
payments, that are convenient for risk selec- 
tion, can be regulated by separate provisions. 
However, the payment package includes sup- 
ply of services from specific partners provided 
for in the contract (say, subcontractors), which 
may be selected by the insurer in question. 
Such selection is especially important in Ma- 
naged Care: for example, by involving many 
sports medicine professionals, the insurer can 
count on the attention of healthy lifestyle ad- 
vocates (low-risk consumers). 

Third, the measure of creating incen- 
tives via additional payments to insurers, who 
take high risks, and imposing financial sanc- 
tions to insurers, who skim creams (favorable 
risks), is a risk adjustment scheme (RAS). The 
payments mentioned depend on such characte- 
ristics observed as age and gender. The mea- 
sure of reimbursing the share of actual costs 
for medical treatment is a cost reimbursement 
scheme (CRS). The idea of CRS is to reduce 
gains from risk selection by decreasing the im- 
pact of costs on the profits of insurers. At the 
same time, the CRS reduces incentives of in- 
surers to control their costs. 

The RAS and CRS can be substantiated 
by modeling risk selection. First of all, due to 
various reasons insurers may differ in their 
terms of insurance for population, the RAS 
and CRS can create a competitive system 
where the favorable risk structure of an insurer 
does not give her a starting advantage. Be- 
sides, the health insurance market may be de- 
stabilized as new insurers enter the market and 
move from high to low risks. The RAS and 


CRS can reduce differences of insurers in pre- 
miums, thereby reducing incentives to the 
movement (transition). 

The insurers, entering the market, can 
insure mostly low risks by facilitating more 
frequent changes of insurers by consumers 
(policyholders) and mixing the overall health 
insurance market. Because insurers, that have 
entered the market earlier, would appear at 
high risks, they eventually have to increase 
their premiums or file for bankruptcy. In such 
circumstances, insurers will have no incentive 
to invest in proving effective payments. 

Indeed, there is evidence of higher low- 
risk mobility in the German health insurance 
market, based on a comparison of the health 
care expenditure (HCE) of those who change 
insurers and those who do not change their 
insurer: depending on age categories, people, 
who changed insurers, had on average 
45-85 % less HCE than the HCE of those 
who did not change insurers. Studies, based on 
the German socioeconomic panel, have shown 
that (adult) people, who remained loyal to 
their insurer, had significantly worse health 
status than people who changed insurers. In 
the United States, there is a case of Harvard 
University’s decision to increase employers’ 
contributions to insurance premiums if emplo- 
yers did not choose the cheapest option 
(Health Maintenance Organization (HMO) 
plan). 

Types of risk began to be identified du- 
ring the year: those who switched from the 
most expensive insurance plans to HMOs had 
a mean age of 46 years and were 9% higher in 
HCE than the overall average HCE; those who 
remained on expensive insurance plans had an 
average age of 50 years and a 16% higher 
HCE compared to the general average HCE. 
The rapid loss of low risks by broad insurance 
plans forced the experiment to stop. 

Thus, the RAS and CRS can help ensure 
a level playing field during the transition to a 
competitive market and the stabilization of 
health insurance market. In the absence of 
schemes such as the RAS and CRS, the mar- 
ket may lose the most efficient insurers. For 
actuaries and other financial professionals, risk 
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adjustment means the accrual of a premium or 
per capita payment in proportion to the ex- 
pected expenses of an individual or group. The 
RAS is based upon risk adjusters — the ob- 
served characteristics of individuals. The de- 
velopment of RAS and the search for appro- 
priate risk adjusters require empirical testing 
of their ability to predict HCE. 

Socio-demographic variables can be risk 
adjusters. Since age and gender have a relati- 
vely small explanatory power, other socio-de- 
mographic variables were studied — marital 
status, retirement status, disability status, edu- 
cational level, income level. Data from the 
German health insurance funds showed that 
elderly pensioners with disabilities have signi- 
ficantly higher HCE. In addition, higher HCEs 
are revealed by single retirees and low-income 
individuals. 

HCE in previous periods is an obvious 
indicator of morbidity: an increase in HCE 
leads to an increase in HCE in the next period 
by 20-30%. At the same time, the explanatory 
capacity of HCE should be weighed against 
the weakening of person’s incentives to reduce 
her costs, because higher current HCE will to 
some extent be compensated to the person la- 
ter. It is through HCE that insurers try to iden- 
tify favorable risks, and there may not be bet- 
ter risk adjusters. Prescription medications in 
previous periods have predicted the value of 
HCE. 

The morbidity can be measured by ga- 
thering available diagnostic information to 
identify chronically ill patients and to classify 
individuals according to their expected HCE. 
This classification can be done by various me- 
thods. The empirical studies show that diag- 
nostic information gives an accurate predict- 
tion of HCE values. In turn, the corresponding 
gathering of information can be expensive. 
Because insurers have an interest in beneficial 
diagnoses for their policyholders, they are also 
interested in the ability to interpret relevant in- 
formation — upcoding: insurers can encourage 
their policyholders to consult with doctors 
more often to select as many diagnoses as 
possible. 
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Many countries and health care systems 
use diagnostic information to determine the re- 
imbursement to a service provider, revealing 
the necessary data. For processing and analy- 
sis of these data, software implementations of 
construction for classifiers, allocation of infor- 
mative features, processing of heterogeneous 
medical and biological variables for carrying 
out scientific research in the field of clinical 
medicine are developed. 

One of the goals of research includes the 
development of approaches and program tools 
for the purpose of the reproducibility of nume- 
rical experiments, which were conducted in 
the framework of the joint project. The goal of 
the project is to develop effective methods and 
software for constructing classifiers, selection 
of informative features, creation of a prototype 
of an intelligent analytical system, which is a 
software implementation of all stages of data 
processing and analysis and is aimed at con- 
ducting research in the field of clinical medi- 
cine. This system will implement the functions 
of integrating clinical and molecular patient 
data, determining diagnostic biomarkers and 
their combinations, building classifiers of 
complex diseases (oncological diseases) based 
on integrated data, identifying new disease 
subtypes to improve treatment methods and 
increase its efficiency. The second goal in- 
cludes the development of the approaches to 

Large amount of research activities de- 
voted to the development of mathematical me- 
thods of data handling, particularly classifica- 
tion models, is due, on the one hand, to a wide 
range of possible applications, and on the 
other hand — the complexity of these prob- 
lems, which requires the development and im- 
provement of means to solve them (see, for 
example, [5—9]). In addition to general requi- 
rements for efficiency of the created software 
there exists a need to pay attention to the con- 
ditions of availability of large and heteroge- 
neous data sets, requirements for the ability to 
transfer programs from one hardware to 
another, their performance in_ cloud 
computing. 
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Moreover, one of the most important re- 
quirements is the reproducibility of research 
numerical experiments. The principle of repro- 
ducibility of research is one of the basic scien- 
tific principles. However, a crisis called "re- 
producibility crisis" has been realized in sci- 
ence [10-11]. This crisis has affected almost 
all branches of science, in particular, to a large 
extent — biology and medicine. Much effort 
has been made recently to overcome this cri- 
sis, including the development of software and 
software platforms to ensure the reproducibi- 
lity of scientific computing. Computing in bio- 
logy and medicine involves the use of high- 
performance computing technologies (inclu- 
ding clusters and grid technologies). However, 
the introduction of modern technologies to en- 
sure the reproducibility of calculations in this 
area is quite slow [12, p. 731]. As a result, in 
the field of cluster technologies, which do not 
have the appropriate software installed, there 
is a contradiction between moder require- 
ments for the reproducibility of scientific cal- 
culations and the ability to achieve it by old 
means. 

It so happened that the need to create a 
containerized application was not a planned 
stage of our study. This was primarily due to 
the ways of accessing the real data on which 
the software was tested. Only then did the 
authors realize that they had gained other ad- 
vantages, among which the most important is 
the reproducibility of research numerical ex- 
periments. It is the purpose of the publication 
to share this experience. 

The second purpose is to shortly de- 
scribe our efforts taken towards the develop- 
ment of specialized computer methods and 
models in order to solve the vital tasks in the 
field of biomedicine. Nowadays there exists 
the enormous amount of biomedical and clini- 
cal data collected in the public and private re- 
positories. They can be freely accessed and 
present the wide field for experiments with the 
newly developed scientific approaches and 
their comparison. The integration of heteroge- 
neous information sources is one of the urgent 
applied problems, which we have tried to 
solve in our project. The hybrid classification 
model presents the basis of the intelligent ana- 


lytical system and aims to integrate several 
sources of biomedical information in order to 
improve the diagnostics and prognosis of 
complex diseases. 

Based on the approaches presented in 
[13-14], optimization models and methods for 
solving problems of constructing linear classi- 
fiers have been developed. In particular, the 
problem of constructing classifiers for linearly 
indivisible sets was formulated as a problem 
of minimizing the band of incorrect classifica- 
tion of training sample points. This model be- 
longs to the class of optimization problems of 
non-convex programming and is multi-ext- 
reme. Various formulations of this problem 
are offered, approaches to construction of ap- 
proximate decisions and calculation of estima- 
tions of optimum values are considered. An 
interesting geometric interpretation of the 
problems of constructing linear classifiers can 
be found in [15]. 

To solve these optimization problems, 
methods of non-smooth optimization, namely 
r-algorithms of N.Z. Shor [16-17] and exact 
penalty functions [18-19] were used. When 
creating appropriate software, modern libra- 
ries of linear algebra, similar to [20-22] 
should be used to speed up arithmetic opera- 
tions. It is a combination of algorithms based 
on non-smooth optimization methods and the 
use of modern libraries of linear algebra was 
implemented in the developed software mo- 
dule NonSmoothSVC. 

To test the abilities of the new classifier 
NonSmoothSVC a comparison with existing 
tools was made. The methods integrated into 
the library scikit-learn [12; 23] were chosen, 
namely Linear SVC, NuSVC, Ada Boost. The 
two last methods are non-linear classifiers; 
they were chosen to get additional information 
concerning advantages of different methods 
for different problems. First numerical experi- 
ments were made on specially generated artifi- 
cial data. 

Computational experiments aimed to es- 
tablish the speed and predictive properties of 
new software compared to existing ones. Both 
artificially created data and real medical data 
were used in the calculations in the test prob- 
lems. Training and control samples of ran- 
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domly generated problems were formed as 
identically distributed data points on a single 
cube in the space of features rR”. Then, the 
points of the first class shifted in the first coor- 
dinate by the value 6, and the points of the se- 
cond class shifted in the first coordinate by the 
value (-1-5). When 6>0, training and control 
samples are linearly separable, and when 5<0, 
they are linearly inseparable. Next, the rotation 
(linear transformation) of space was_per- 
formed so that the separating hyperplane de- 
pended on many coordinates of space. The 
need to test new software on real data forced 
us to locate the software module 
NonSmoothSVC into a containerized applica- 
tion (using Docker technology [24]) for use on 
a personal computer, as well as on a cluster, 
grid, and cloud environment. 
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Fig. 1. Comparative density distribution 
for full data set (n=12000) 


This permitted to get access to the real 
data on Cancer Genomics Cloud [25], a speci- 
alized cloud platform that provides free access 
to genetic, medical databases, in particular — 
The Cancer Genome Atlas (TCGA) [26], and 
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more than 450 public applications designed to 
analyze data on this topic. It is possible to ex- 
pand this list with the own applications, data 
sets, research results (currently there are more 
than one million on this service), to involve 
other researchers in projects. Computational 
experiments have demonstrated that on some 
data sets the NonSmoothSVC has qualitative 
advantages over other methods involved in the 
comparison, but is inferior in speed. Parti- 
cularly, on linearly separable samples the 
NonSmoothSVC gained an advantage over the 
LinearSVC in the number of cases with better 
classification accuracy. 
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Fig. 2. Comparative density distribution 
for imbalanced data set (n=3720) 


On the unbalanced samples, the 
NonSmoothSVC software slightly outperfor- 
med the LinearSVC software in the number of 
cases with better classification accuracy on 
average, but demonstrated an advantage in 
some parts of the classification accuracy scale 
(Fig. 1-6). 

Full description of numerical experi- 
ments and the results of testing can be found 
in the reports (in Ukrainian) at 


http://moderninform.icybcluster.org.ua/ais/. 
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Thanks to the containerized form, the 
developed software can become publicly avai- 
lable tools and applications of this and other 
services in the problems of constructing opti- 
mized linear classifiers using modern libraries 
of linear algebra. 
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Fig. 3. Comparative density distribution 
for large data set (n=7200) 


In the presence of technical possibilities, 
parallelization on microprocessor networks 
looks promising. 

This approach is especially recom- 
mended in the case of large data samples, when 
the dimension of the feature space is tens of 
thousands. It is also necessary to take into 
account the features of optimization problems 
in specific cases. In particular, additional requi- 
rements that may be formulated by specialists 
may reduce the number of informative features. 

Processing and study of biomedical data 
have some peculiarities. This, in particular, the 
existence of possible large errors that arise in 
the processing of medical information and 
huge number of features that need to be taken 


into account, which increases the dimensiona- 
lity of the corresponding optimization prob- 
lems, the missed measurements, which requi- 
res the use of specialized methods for their 
processing and analysis. 

In order to improve the diagnosis and 
treatment of complex diseases, much attention 
is paid to the comprehensive analysis of vari- 
ous biomedical and clinical data to understand 
the processes occurring in the body at the cel- 
lular level and changes caused by the develop- 
ment of the disease. 

It is known, the cause of complex disea- 
ses, along with external factors, is a combina- 
tion of genetic failures, which does not allow 
to fix only one genetic mutation as a biomar- 
ker. The difficulty also lies in the fact that in- 
dividual genetic factors can differ and indivi- 
dual cases of the same disease (phenotype) 
can be caused by different genetic changes. In 
addition, in the case of the combined effect of 
several mutations, the individual effect of each 
of them can be rather insignificant and, there- 
fore, difficult to be detected. 

It is also necessary to take into account 
the high heterogeneity of the complex disease, 
ie. heterogeneity of its observed manifesta- 
tions (phenotypes). 

Recently, the methods of systems bio- 
logy have become widely used to study comp- 
lex diseases, namely, knowledge about the 
interactions between genes, their products and 
small molecules that form a complex network 
of interactions. This approach makes it pos- 
sible to explain the appearance of similar phe- 
notypes despite different genetic causes, na- 
mely, their interconnection and influence (dys- 
regulation) on the same component of the cel- 
lular system. Thus, the use of interactome in 
conjunction with other data from biogenetic 
studies can contribute to understanding the 
processes occurring at the molecular level in 
complex diseases. The use of combinations of 
heterogeneous data makes it possible to deter- 
mine dysregulated cellular pathways, to reveal 
the relationship between genotype and pheno- 
type, and to explain the heterogeneity of a 
complex disease. 
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Natural approaches here are: to increase 
the efficiency of tools to solve such optimiza- 
tion problems and the use of methods for se- 
lection informative features. In the works 
[27-30] attention is paid to the preliminary 
preparation of available medical data in order 
to select informative features. 

In the course of the project, algorithms 
for preprocessing and extracting biomarkers 
from biomedical data were developed, inclu- 
ding: an algorithm for ranking features by in- 
formation content for classification [23]; an 
algorithm for identifying combinations of bio- 
markers, taking into account the correlation of 
features and allowing to exclude their 
influence. 
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Moreover, several approaches were ana- 
lyzed for identifying a subset of informative 
features, taking into account several data sour- 
ces, namely, gene expression data and data on 
functional and physical interactions of genes 
and their products, presented in the form of 
networks. Based on the analysis of existing 
approaches, an algorithm for identifying a 
subset of features has been developed, which 
allows integrating interactomic and transcrip- 
tomic data to determine functional subnets as- 
sociated with the disease. Pre-processing of 
biomedical data made it possible to reduce the 
feature space and thereby increase the accu- 
racy of classification models. 

Detailed description of algorithms and 
related information can be found in the report 
at http://moderninform.icybcluster.org.ua/ais/ 
(in Russian). 

In one of the numerical experiments the 
real data contained information on the gene 
expression of cancer patients (143 observa- 
tions of 60,483 features) obtained from the 
Cancer Genome Atlas (TCGA). From these 
data by means of the simplified method of ran- 
king of features proposed by Novoselova [28] 
23 most informative features concerning the 
forecast of a vital status of patients having 
diagnosed glioblastoma were identified. This 
approach substantially simplifies numerical 
difficulties in following data processing. 

Due to the fact that various sources of 
biological information characterize various 
changes occurring in the body at the cellular 
level during the development of a complex 
disease, it is assumed that their combination 
will improve the accuracy of diagnosis of the 
subtype of the disease, the reliability of the 
disease prognosis and response to therapy 
[31]. In addition, combining heterogeneous 
data will allow one to discover the relation- 
ships between various biomedical entities (ge- 
nes, proteins, metabolites, etc.) directly related 
to the development of the disease, compensate 
for noise and errors in individual data sources 
and thereby obtain more reliable results. 
A common problem in solving this problem is 
how to combine information from different 


data sources. In our study, of interest are me- 

thods for constructing classifiers based on va- 

rious sources of multidimensional data, which, 
as a rule, have a heterogeneous representation. 

Consequently, the task is to unify this repre- 

sentation, determine the base classifier, build 

classification models on each data source, and 
select ways to combine the predicted values, 
obtained using the constructed models. 

The core of the intelligent analytical 
system being developed is a hybrid classifica- 
tion model, which allows combining several 
sources of biological information about pati- 
ents in order to build a classification model 
that allows diagnosing subtypes of complex 
diseases characterized by genetic disorders. 
The proposed hybrid model is a classification 
ensemble with the following distinctive 
features: 

1. Uniform presentation of information from 
various data sources by constructing a mat- 
rix of object-object distances using various 
kernel functions (density functions), inclu- 
ding Gaussian, polynomial function, scalar 
product of vectors, etc. 

2. Implementation of the procedure for selec- 
ting classification characteristics for each 
individual data source. 

3. Construction of a basic or individual classi- 
fier of a hybrid model, which can be either 
a single classifier or an ensemble of classi- 
fiers built on a single data source. 

4. Implementation of several ways of integra- 
ting individual classifiers of the model. 

5. Analysis of the information content of indi- 
vidual classifiers using the assessment of 
their weight coefficients. 

The method for constructing a hybrid 
model is based on a combination of the bag- 
ging procedure and the aggregation of ranked 
lists to build basic classifiers and a pruning 
procedure to determine the final structure of 
the model, which allows adaptively adjusting 
the ensemble taking into account the type of 
classified data. 

The preliminary experiments on the 
TCGA data [26] showed that the ensembles 
built on heterogeneous data sources can suffi- 
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ciently increase the accuracy of classification 
and prediction of subtypes of complex disea- 
ses, since each of the data sources describes 
the organism under study in different planes: 
gene expression data, Ribonucleic acid (RNA) 
sequencing, metabolic data, gene copy number 
data, etc. 

Ensuring the reproducibility of calcula- 
tions is a prerequisite for the reproducibility of 
scientific research as a whole. The conditions 
for computational reproducibility are the avai- 
lability of source data, the ability to reproduce 
an identical computing environment (or an en- 
vironment that does not lead to other calcula- 
tion results), and the availability of the results 
of computations. Biomedical calculations have 
their own specific features that should be 
taken into account when planning them. Let 
we mention some of them. 

Modern biomedical calculations, espe- 
cially based on genome data, are very huge 
and cumbersome. Usually "classic" biome- 
dical applications (PAML, Muscle, MAFFT, 
MrBayes, BLAST, etc.) and large libraries 
with implementations of biomedical algo- 
rithms written in different programming lan- 
guages (C/C +4, Java, R, Go, Scala, Haskell, 
Perl, Python, Ruby, Erlang, Julia, etc. [32]) are 
quite often used simultaneously in one study. 
Moreover, biomedical calculations often in- 
volve methods of artificial intelligence — ma- 
chine learning, pattern recognition, and corres- 
ponding libraries (e.g., scikit-learn [6], [17]). 
Such a variety of software requires careful 
configuration of the computing environment 
with control of the versions of libraries used 
(here can be used as dozens and hundreds of 
libraries). 

Otherwise one can get a lack of reprodu- 
cibility as a result of calculations. In terms of 
using cluster technologies, creating such envi- 
ronments (separate for each user) and maintai- 
ning them in a conflict-free state is quite a bur- 
densome task (unless you use special software 
configuration tools, such as Conda, Bioconda, 
or containerization of applications using, for 
example, technology Singularity). Most of the 
libraries and applications used in biomedical 
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computing do not provide efficient use of pa- 
rallel multithreaded computing with multi- 
core processors, and at the same time many of 
them can be applied to an "embarrassingly pa- 
rallel" model — a model in which individual 
pieces of data are calculated in parallel by 
identical instances of computational processes 
without transferring messages between them 
(for example, using Apache Hadoop 
technology) [12]. 

Taking into account the peculiarities of 
biomedical computing, reproducibility and 
their horizontal scaling (the ability to increase 
the number of identical computing units to 
solve one problem) can be achieved through 
the use of containerized applications, software 
pipeline computing and parameterization of 
software environment. 

Technologies of containerization of soft- 
ware applications. Due to the containerization 
of biomedical applications (Docker, Singula- 
rity containerization technology) the following 
can be achieved: reproducibility of the con- 
ditions in which the calculations took place 
(invariability of software including software 
and libraries), the possibility of horizontal sca- 
ling provided the use of "stunning" model of 
parallelism in cluster (Singularity) and cloud 
(using Docker) calculations. 

Technologies of software pipelining of 
calculations. Software pipeline allows you to 
organize flow calculations (calculations in 
which the inputs and outputs of processes are 
interconnected). Thanks to the use of tools for 
automation of flow calculations (workflow en- 
gine) such as CWL (Common Workflow Lan- 
guage), GWL (Guix Workflow Language), 
Snakemake, Nextflow, it is possible to present 
a specific calculation in the form of a task 
(text file, as usual, in YAML format or 
JSON), the results of which can be reproduced 
[7]. In addition, there are tools that allow you 
to create / display such tasks in the form of a 
graph of processes and data flows. An 
example of such a tool is RABIX (Reprodu- 
cible Analyzes for Bioinformatics) — a graphi- 
cal editor for CWL. Some pipeline tools also 
use containerization (for example, CWL) — 
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such tasks can be performed both on a perso- 
nal computer and in a cloud environment. An 
important feature of streaming automation 
tools is that the task description syntax allows 
you to specify the scale of the calculations, in- 
dicating the number of resources required. 
Seven Bridges’ product, Cancer Genomics 
Cloud (CGC, see 
http://www.cancergenomicscloud.org/), is an 
example of a cloud software platform for per- 
forming reproducible biomedical computa- 
tions using containerization and pipelining. It 
is the use of containerization in the creation of 
an application for the construction of a linear 
classifier at the V.M.Glushkov Institute of 
Cybernetics of the National Academy of Sci- 
ences of Ukraine made it possible to conduct 
testing on real very voluminous medical data 
located at the CGC. 

Technologies for parameterization of 
software environment. Parameterization of the 
software environment allows you to repro- 
duce, if necessary, an identical computing en- 
vironment. GNU Guix, Conda, Bioconda are 
examples of tools that allow you to create an 
isolated software environment for individual 
users in a cluster [12]. 

At present, there exists a range of tech- 
nologies to ensure the reproducibility of scien- 
tific calculations in cloud and cluster environ- 
ments. This makes it possible to create biome- 
dical applications adapted to these environ- 
ments. In the result we get computational basis 
that satisfies modern requirements for compu- 
tational reproducibility. 

The experience of using the developed 
linear classifier, gained during its testing on 
artificial and real data, allows us to conclude 
about several advantages provided by the con- 
tainerized form of the created application: it 
permits to provide access to real data located 
in cloud environment; it is possible to perform 
calculations to solve research problems on 
cloud resources both with the help of deve- 
loped tools and with the help of cloud ser- 
vices; such a form of research organization 
makes numerical experiments reproducible, 
i.e. any other researcher can compare the re- 


sults of their developments on specific data 
that have already been studied by others, in or- 
der to verify the conclusions and technical fea- 
sibility of new results; there exists an universal 
opportunity to use the developed tools on 
technical devices of various classes from a 
personal computer to powerful cluster. 

Conclusions 

The next steps of the project include 
development of the common software inter- 
face of the experimental prototype of the intel- 
ligent analytical system in order to integrate 
the developed methods and software modules 
of biomedical data preprocessing, data cluste- 
ring and classification. It will allow perfor- 
ming all the steps of data analysis from the 
single framework and conducting research in 
the field of biomedicine. The hybrid classifica- 
tion model as a core of the intelligent system 
will make it possible to integrate multidimen- 
sional, heterogeneous biomedical data with the 
aim to better understand the molecular courses 
of disease origin and development, to improve 
the identification of disease subtypes and dise- 
ase prognosis. Much attention will be paid to 
the experimentation with different computa- 
tion approaches on real datasets taking into ac- 
count the reproducibility of results. 
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